{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ```read_sql()``` function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The read_sql function issues a SQL query to a specific database and return the result in a Python DataFrame.\n",
    "In this section, we show how to use this feature.\n",
    "\n",
    "## Install `connectorx`\n",
    "\n",
    "Connector wrap up the function on [connectorx](https://github.com/sfu-db/connector-x), in order to enable `read_sql`, you need to first install it by running:\n",
    "\n",
    "```\n",
    "pip install connectorx\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## API\n",
    "\n",
    "```python\n",
    "read_sql(conn: str, query: Union[List[str], str], *, return_type: str = \"pandas\", protocol: str = \"binary\", partition_on: Optional[str] = None, partition_range: Optional[Tuple[int, int]] = None, partition_num: Optional[int] = None)\n",
    "```\n",
    "\n",
    "Run the SQL query, download the data from database into a Pandas dataframe.\n",
    "\n",
    "## Parameters\n",
    "- `conn: str`: Connection string URI.\n",
    "  - General supported URI scheme: `(postgres|postgressql|mysql|mssql)://username:password@addr:port/dbname`.\n",
    "  - For now sqlite only support absolute path, example: `sqlite:///home/user/path/test.db`.\n",
    "- `query: Union[str, List[str]]`: SQL query or list of SQL queries for fetching data.\n",
    "- `return_type: str = \"pandas\"`: The return type of this function. It can be `arrow`, `pandas`, `modin`, `dask` or `polars`.\n",
    "- `protocol: str = \"binary\"`: The protocol used to fetch data from source, default is `binary`. Check out [here](https://github.com/sfu-db/connector-x/blob/main/Types.md) to see more details.\n",
    "- `partition_on: Optional[str]`: The column to partition the result.\n",
    "- `partition_range: Optional[Tuple[int, int]]`: The value range of the partition column.\n",
    "- `partition_num: Optioinal[int]`: The number of partitions to generate.\n",
    "\n",
    "## Examples\n",
    "- Read a DataFrame from a SQL using a single thread\n",
    "\n",
    "  ```python\n",
    "  from dataprep.connector import read_sql\n",
    "\n",
    "  postgres_url = \"postgresql://username:password@server:port/database\"\n",
    "  query = \"SELECT * FROM lineitem\"\n",
    "\n",
    "  read_sql(postgres_url, query)\n",
    "  ```\n",
    "\n",
    "- Read a DataFrame parallelly using 10 threads by automatically partitioning the provided SQL on the partition column (`partition_range` will be automatically  queried if not given)\n",
    "\n",
    "  ```python\n",
    "  from dataprep.connector import read_sql\n",
    "\n",
    "  postgres_url = \"postgresql://username:password@server:port/database\"\n",
    "  query = \"SELECT * FROM lineitem\"\n",
    "\n",
    "  read_sql(postgres_url, query, partition_on=\"l_orderkey\", partition_num=10)\n",
    "  ```\n",
    "\n",
    "- Read a DataFrame parallelly using 2 threads by manually providing two partition SQLs (the schemas of all the query results should be same)\n",
    "\n",
    "  ```python\n",
    "  from dataprep.connector import read_sql\n",
    "\n",
    "  postgres_url = \"postgresql://username:password@server:port/database\"\n",
    "  queries = [\"SELECT * FROM lineitem WHERE l_orderkey <= 30000000\", \"SELECT * FROM lineitem WHERE l_orderkey > 30000000\"]\n",
    "\n",
    "  read_sql(postgres_url, queries)\n",
    "\n",
    "  ```\n",
    "  \n",
    "- Read a DataFrame parallelly using 4 threads from a more complex query\n",
    "\n",
    "  ```python\n",
    "  from dataprep.connector import read_sql\n",
    "\n",
    "  postgres_url = \"postgresql://username:password@server:port/database\"\n",
    "  query = f\"\"\"\n",
    "  SELECT l_orderkey,\n",
    "         SUM(l_extendedprice * ( 1 - l_discount )) AS revenue,\n",
    "         o_orderdate,\n",
    "         o_shippriority\n",
    "  FROM   customer,\n",
    "         orders,\n",
    "         lineitem\n",
    "  WHERE  c_mktsegment = 'BUILDING'\n",
    "         AND c_custkey = o_custkey\n",
    "         AND l_orderkey = o_orderkey\n",
    "         AND o_orderdate < DATE '1995-03-15'\n",
    "         AND l_shipdate > DATE '1995-03-15'\n",
    "  GROUP  BY l_orderkey,\n",
    "            o_orderdate,\n",
    "            o_shippriority \n",
    "  \"\"\"\n",
    "\n",
    "  read_sql(postgres_url, query, partition_on=\"l_orderkey\", partition_num=4)\n",
    "\n",
    "  ```"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
